This module extends code contained in Coronavirus_Statistics_v003.Rmd to include 2020-2021 data. This file includes the latest code for analyzing all-cause death data from CDC Weekly Deaths by Jurisdiction. CDC maintains data on deaths by week, age cohort, and state in the US. Downloaded data are unique by state, epidemiological week, year, age, and type (actual vs. predicted/projected).
These data are known to have a lag between death and reporting, and the CDC back-correct to report deaths at the time the death occurred even if the death is reported in following weeks. This means totals for recent weeks tend to run low (lag), and the CDC run a projection of the expected total number of deaths given the historical lag times. Per other analysts on the internet, there is currently significant supra-lag, with lag times much longer than historical averages causing CDC projected deaths for recent weeks to be low.
Functions for the previous module are available in Coronavirus_Statistics_Functions_CDC_v003.R and Coronavirus_Statistics_Functions_Shared_v003.R. The appropriate subset of these functions are copied and edited for use in this module. The code leverages tidyverse and a variable mapping file throughout:
# All functions assume that tidyverse and its components are loaded and available
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# If the same function is in both files, use the version from the more specific source
source("./Generic_Added_Utility_Functions_202105_v001.R")
# source("./Coronavirus_Statistics_Functions_USAF_v003.R")
The main function is readRunCDCAllCause(), which performs multiple tasks:
STEP 0: Optionally, downloads the latest data file from CDC STEP 1: Reads and processes a data file has been downloaded from CDC to local
STEP 2: Extract relevant data from a processed state-level COVID Tracking Project list
STEP 3: Basic plots of the CDC data
STEP 4: Basic excess-deaths analysis
STEP 5: Create cluster-level aggregate plots
STEP 6: Create state-level aggregate plots
STEP 7: Create age-cohort aggregate plots
STEP 8: Returns a list of key data frames, modeling objects, named cluster vectors, etc.
The main function to be edited is readRunCDCAllCause():
# Function to read and run the CDC all-cause deaths analysis
readRunCDCAllCause <- function(loc,
startYear,
curYear,
weekThru,
startWeek,
lst,
cvDeathThru,
epiMap=readFromRDS("epiMonth"),
agePopData=readFromRDS("usPopBucket2020"),
cdcPlotStartWeek=startWeek,
periodKeep=paste0(startYear, "-", curYear-1),
url="https://data.cdc.gov/api/views/y5bj-9g5w/rows.csv?accessType=DOWNLOAD",
dlData=isFALSE(file.exists(paste0(dir, loc))),
dir="./RInputFiles/Coronavirus/",
stateNoCheck=c()
) {
# FUNCTION ARGUMENTS:
# loc: the CDC .csv file name (without path)
# startYear: the starting year in the CDC data
# curYear: the current analyis year in the CDC data
# weekThru: how many weeks of the current year are the data valid thru?
# startWeek: the starting week to use for cumulative sum of difference in expected all-cause deaths
# lst: a state clustering process output list
# epiMap: a mapping file of ew-month-quarter that mas epiweek (ew) to an appropriate month and quarter
# agePopData: data containing US population as age (fct) - pop (int)
# cvDeathThru: the date to use for pulling the CV death data
# cdcPlotStartWeek: start week for CDC plots (10 is March which avoids a 1-week February outlier)
# periodKeep: the period of previous data in the CDC all-cause deaths file (should be kept regardless)
# url: the url for downloading CDC data
# dlData: should CDC data be downloaded? default is to not overwrite an existing file
# dir: the CDC .csv directory (will use paste0(dir, loc) as the file location)
# stateNoCheck: vector of states that should not error out for failing suppression checks
# STEP 0: Download CDC data if requested
if (dlData) fileDownload(fileName=paste0(dir, loc), url=url)
return()
# STEP 1: Read and process the CDC data
cdc <- readProcessCDC(loc, weekThru=weekThru, periodKeep=periodKeep, fDir=dir, stateNoCheck=stateNoCheck)
# STEP 2: Create the key data required for using state-level clusters
clusterList <- helperKeyStateClusterMetrics(lst)
# STEP 3: Generate plots of the processed CDC data
cdcBasicPlots(cdc, clustVec=clusterList$clData, stateExclude=stateNoCheck)
# Exclude stateNoCheck states
cat("\nPlots will be run after excluding stateNoCheck states\n")
cdcUse <- cdc %>% filter(!(state %in% stateNoCheck))
# STEP 4: Full US excess deaths
list_allUS <- cdcCohortAnalysis(cohortName="all US",
df=cdcUse,
curYear=curYear,
startYear=startYear,
startWeek=startWeek,
plotTitle="All-cause US total deaths",
predActualPlotsOnePage=TRUE
)
# cat("\n\n\n*** THROUGH STEP 4 ***\n\n\n")
# STEP 5: Generate cluster-level aggregate plots
clusterAgg <- cdcAggregateSummary(df=cdcUse,
critVar="state",
critSubsets=clusterList$stateCluster,
startWeek=startWeek,
critListNames=paste0("cluster ", 1:length(clusterList$stateCluster)),
factorCritList=FALSE,
popData=clusterList$pop,
cvDeathData=clusterList$deaths,
idVarName="cluster"
)
# cat("\n\n\n*** THROUGH STEP 5 ***\n\n\n")
# STEP 6: Generate state-level aggregate data, then plot
stateAgg <- cdcAggregateSummary(df=cdcUse,
critVar="state",
critSubsets=setdiff(names(clusterList$clData), stateNoCheck),
startWeek=startWeek,
idVarName="state",
subListNames=setdiff(names(clusterList$clData), stateNoCheck),
showAllPlots=FALSE
)
# cat("\n\n\n*** THROUGH STEP 6a ***\n\n\n")
helperKeyStateExcessPlots(df=stateAgg,
epiMonth=epiMap,
cvDeaths=lst$consolidatedPlotData,
startWeek=cdcPlotStartWeek,
cvDeathDate=as.Date(cvDeathThru),
subT=paste0("CDC data through week ", weekThru, " of ", curYear)
)
# cat("\n\n\n*** THROUGH STEP 6b ***\n\n\n")
# STEP 7: Generate age-level aggregate data, then plot
ageAgg <- cdcAggregateSummary(df=cdcUse,
critVar="age",
critSubsets=levels(cdc$age),
startWeek=startWeek,
idVarName="age",
subListNames=levels(cdc$age),
showAllPlots=TRUE
)
helperKeyAgeExcessPlots(df=ageAgg,
epiMonth=epiMap,
cvDeaths=lst$consolidatedPlotData,
popData=agePopData,
startWeek=cdcPlotStartWeek,
cvDeathDate=as.Date(cvDeathThru),
subT=paste0("CDC data through week ", weekThru, " of ", curYear)
)
# STEP 8: Return a list of key files
list(cdc=cdc,
clusterList=clusterList,
allUSAgg=list_allUS$preds,
clusterAgg=clusterAgg,
stateAgg=stateAgg,
ageAgg=ageAgg
)
}
The data are initially run to download the latest CDC data:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210623.csv"
readRunCDCAllCause(loc=cdcLoc)
## NULL
The data are inspected to check that they are sufficient for the next steps:
remapVars=c('Jurisdiction'='fullState',
'Week Ending Date'='weekEnding',
'State Abbreviation'='state',
'Age Group'='age',
'Number of Deaths'='deaths',
'Time Period'='period',
'Year'='year',
'Week'='week'
)
cdcRaw <- fileRead(paste0("./RInputFiles/Coronavirus/", cdcLoc), col_types="cccddcdcccc") %>%
colRenamer(vecRename=remapVars)
cdcRaw
## # A tibble: 197,545 x 11
## fullState weekEnding state year week age deaths period Type Suppress
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 Alabama 01/10/2015 AL 2015 1 25-44~ 67 2015-2~ Predic~ <NA>
## 2 Alabama 01/17/2015 AL 2015 2 25-44~ 49 2015-2~ Predic~ <NA>
## 3 Alabama 01/24/2015 AL 2015 3 25-44~ 55 2015-2~ Predic~ <NA>
## 4 Alabama 01/31/2015 AL 2015 4 25-44~ 59 2015-2~ Predic~ <NA>
## 5 Alabama 02/07/2015 AL 2015 5 25-44~ 47 2015-2~ Predic~ <NA>
## 6 Alabama 02/14/2015 AL 2015 6 25-44~ 59 2015-2~ Predic~ <NA>
## 7 Alabama 02/21/2015 AL 2015 7 25-44~ 41 2015-2~ Predic~ <NA>
## 8 Alabama 02/28/2015 AL 2015 8 25-44~ 47 2015-2~ Predic~ <NA>
## 9 Alabama 03/07/2015 AL 2015 9 25-44~ 59 2015-2~ Predic~ <NA>
## 10 Alabama 03/14/2015 AL 2015 10 25-44~ 57 2015-2~ Predic~ <NA>
## # ... with 197,535 more rows, and 1 more variable: Note <chr>
cdcRaw %>%
group_by(state, Type) %>%
summarize(deaths=sum(deaths, na.rm=TRUE)) %>%
arrange(-deaths)
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
## # A tibble: 108 x 3
## # Groups: state [54]
## state Type deaths
## <chr> <chr> <dbl>
## 1 US Predicted (weighted) 18787183
## 2 US Unweighted 18736023
## 3 CA Predicted (weighted) 1804503
## 4 CA Unweighted 1802307
## 5 TX Predicted (weighted) 1367691
## 6 FL Predicted (weighted) 1367661
## 7 FL Unweighted 1366098
## 8 TX Unweighted 1363538
## 9 PA Predicted (weighted) 898425
## 10 PA Unweighted 897361
## # ... with 98 more rows
cdcRaw %>%
mutate(subUnit=ifelse(state=="US", "whole", "components")) %>%
group_by(subUnit, Type, year) %>%
summarize(deaths=sum(deaths, na.rm=TRUE), .groups="drop") %>%
ggplot(aes(x=year)) +
geom_col(aes(y=deaths), fill="lightblue") +
geom_text(aes(y=deaths/2, label=paste0(round(deaths/1000000, 2), " MM")), size=3) +
facet_grid(Type~subUnit) +
labs(x=NULL, y="Deaths", title="Raw CDC Summed Deaths")
Back of the envelope, the data appear reasonable for use in refining code.
The readrProcessCDC() function is updated:
ageLevels <- c("Under 25 years", "25-44 years", "45-64 years",
"65-74 years", "75-84 years", "85 years and older"
)
periodLevels <- c("2015-2019", "2020", "2021")
yearLevels <- 2015:2021
# Function to read and process raw CDC all-cause deaths data
readProcessCDC <- function(fName,
weekThru,
periodKeep=c("2015-2019", "2020"),
fDir="./RInputFiles/Coronavirus/",
col_types="ccciicdcccc",
renameVars=remapVars, # Update this approach
maxSuppressAllowed=10,
stateNoCheck=c()
) {
# FUNCTION ARGUMENTS:
# fName: name of the downloaded CDC data file
# weekThru: any record where week is less than or equal to weekThru will be kept
# periodKeep: any record where period is in periodKeep will be kept
# fDir: directory name for the downloaded CDC data file
# col_types: variable type by column in the CDC data (passed to readr::read_csv())
# renameVars: named vector for variable renaming of type c("Existing Name"="New Name")
# maxSuppressAllowed: maximum number of data suppressions (must be in current week/year) to avoid error
# stateNoCheck: vector of states that do NOT have suppression errors thrown
# STEP 1: Read the CSV data
cdcRaw <- fileRead(paste0(fDir, fName), col_types=col_types)
# glimpse(cdcRaw)
# STEP 2: Rename the variables for easier interpretation
cdcRenamed <- cdcRaw %>%
colRenamer(vecRename=renameVars) %>%
colMutater(selfList=list("weekEnding"=lubridate::mdy))
# glimpse(cdcRenamed)
# STEP 3: Convert to factored data
cdcFactored <- cdcRenamed %>%
colMutater(selfList=list("age"=factor), levels=ageLevels) %>%
colMutater(selfList=list("period"=factor), levels=periodLevels) %>%
colMutater(selfList=list("year"=factor), levels=yearLevels)
# glimpse(cdcFactored)
# STEP 4: Filter the data to include only weighted deaths and only through the desired time period
cdcFiltered <- cdcFactored %>%
rowFilter(lstFilter=list("Type"="Predicted (weighted)")) %>%
filter(period %in% all_of(periodKeep) | week <= weekThru)
# glimpse(cdcFiltered)
# STEP 4a: Check that all suppressed data and NA deaths have been eliminated
cat("\n\n *** Data suppression checks *** \n")
checkProblems <- cdcFiltered %>%
mutate(problem=(!is.na(Suppress) | is.na(deaths)),
curWeek=(!(year %in% all_of(periodKeep)) & week==weekThru),
noCheck=state %in% all_of(stateNoCheck)
) %>%
group_by(noCheck, state, problem, curWeek) %>%
summarize(n=n(), deaths=specNA(sum)(deaths), .groups="drop")
errorProblems <- checkProblems %>%
filter(problem)
print(errorProblems)
errorProblems <- errorProblems %>%
group_by(noCheck, curWeek) %>%
summarize(n=sum(n), .groups="drop")
print(errorProblems)
nPrev <- errorProblems %>% filter(!noCheck, !curWeek) %>% pull(n) %>% sum()
nCur <- errorProblems %>% filter(!noCheck, curWeek) %>% pull(n) %>% sum()
if (nPrev > 0) stop(paste0("\n\n", nPrev, " records not from current week/year with problems. Fix and retry\n"))
if (nCur > maxSuppressAllowed) {
cat("\n\n", nCur, " records from current week/year with problems, exceeds tolerance of ", maxSuppressAllowed)
stop("\nFix and retry\n")
}
cat("\n\nData suppression checks passed\n\n")
# STEP 5: Remove any NA death fields, delete the US record, convert YC to be part of NY
cdcProcessed <- cdcFiltered %>%
rowFilter(lstExclude=list("state"=c("US", "PR"), "deaths"=c(NA))) %>%
mutate(state=ifelse(state=="YC", "NY", state),
fullState=ifelse(state %in% c("NY", "YC"), "New York State (NY plus YC)", fullState)
) %>%
group_by(fullState, weekEnding, state, year, week, age, period, Type, Suppress) %>%
arrange(!is.na(Note)) %>%
summarize(n=n(), deaths=sum(deaths), Note=first(Note), .groups="drop") %>%
ungroup() %>%
checkUniqueRows(uniqueBy=c("state", "year", "week", "age"))
glimpse(cdcProcessed)
# STEP 5a: Check control levels for key variables in processed file
cat("\nCheck Control Levels and Record Counts for Processed Data:\n")
cdcCheck <- cdcProcessed %>% mutate(n=1, n_deaths_na=ifelse(is.na(deaths), 1, 0))
purrr::walk(list(c("age"), c("period", "year", "Type"), c("period", "Suppress"), c("period", "Note")),
.f=function(x) {
cat("\n\nChecking variable combination:", x, "\n")
checkControl(cdcCheck, groupBy=x, useVars=c("n", "n_deaths_na", "deaths"), fn=specNA(sum))
}
)
p1 <- checkControl(cdcCheck,
groupBy=c("state"),
useVars=c("deaths"),
fn=specNA(sum),
printControls=FALSE,
pivotData=FALSE
) %>%
ggplot(aes(x=fct_reorder(state, deaths), y=deaths)) +
geom_col(fill="lightblue") +
geom_text(aes(y=deaths, label=paste0(round(deaths/1000), "k")), hjust=0, size=3) +
coord_flip() +
labs(y="Total deaths", x=NULL, title="Total deaths by state in all years in processed file")
print(p1)
p2 <- checkControl(cdcCheck,
groupBy=c("year", "week"),
useVars=c("deaths"),
fn=specNA(sum),
printControls=FALSE,
pivotData=FALSE
) %>%
ggplot(aes(x=week, y=deaths)) +
geom_line(aes(group=year, color=year)) +
labs(title="Deaths by year and epidemiological week", x="Epi week", y="US deaths") +
scale_color_discrete("Year") +
lims(y=c(0, NA))
print(p2)
# STEP 6: Return the processed data file
cdcProcessed
}
cdc_temp <- readProcessCDC(cdcLoc, weekThru=17, stateNoCheck=c("NC"))
##
##
## *** Data suppression checks ***
## # A tibble: 2 x 6
## noCheck state problem curWeek n deaths
## <lgl> <chr> <lgl> <lgl> <int> <dbl>
## 1 TRUE NC TRUE FALSE 72 NA
## 2 TRUE NC TRUE TRUE 6 NA
## # A tibble: 2 x 3
## noCheck curWeek n
## <lgl> <lgl> <int>
## 1 TRUE FALSE 72
## 2 TRUE TRUE 6
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 91,537
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10735 0 369164
## 2 25-44 years 13656 0 902390
## 3 45-64 years 16793 0 3549786
## 4 65-74 years 16783 0 3558139
## 5 75-84 years 16790 0 4401133
## 6 85 years and older 16780 0 5681860
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691180
## 2 2015-2019 2016 Predicted (weighted) 14445 0 2723236
## 3 2015-2019 2017 Predicted (weighted) 14404 0 2801986
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830372
## 5 2015-2019 2019 Predicted (weighted) 14415 0 2844025
## 6 2020 2020 Predicted (weighted) 14837 0 3433405
## 7 2021 2021 Predicted (weighted) 4672 0 1138268
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72028 0 13890799
## 2 2020 <NA> 14837 0 3433405
## 3 2021 <NA> 4672 0 1138268
##
##
## Checking variable combination: period Note
## # A tibble: 9 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72028 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only ~ 13194 0 2.96e6
## 3 2020 Data in recent weeks are incomplete. Only ~ 531 0 2.31e5
## 4 2020 Weighted numbers of deaths are 20% or more~ 280 0 6.00e4
## 5 2020 Weights may be too low to account for unde~ 18 0 9.85e3
## 6 2020 <NA> 814 0 1.69e5
## 7 2021 Data in recent weeks are incomplete. Only ~ 4469 0 1.10e6
## 8 2021 Data in recent weeks are incomplete. Only ~ 14 0 9.65e2
## 9 2021 Data in recent weeks are incomplete. Only ~ 189 0 3.58e4
cdc_temp
## # A tibble: 91,537 x 12
## fullState weekEnding state year week age period Type Suppress n
## <chr> <date> <chr> <fct> <int> <fct> <fct> <chr> <chr> <int>
## 1 Alabama 2015-01-10 AL 2015 1 Under 2~ 2015-~ Predic~ <NA> 1
## 2 Alabama 2015-01-10 AL 2015 1 25-44 y~ 2015-~ Predic~ <NA> 1
## 3 Alabama 2015-01-10 AL 2015 1 45-64 y~ 2015-~ Predic~ <NA> 1
## 4 Alabama 2015-01-10 AL 2015 1 65-74 y~ 2015-~ Predic~ <NA> 1
## 5 Alabama 2015-01-10 AL 2015 1 75-84 y~ 2015-~ Predic~ <NA> 1
## 6 Alabama 2015-01-10 AL 2015 1 85 year~ 2015-~ Predic~ <NA> 1
## 7 Alabama 2015-01-17 AL 2015 2 Under 2~ 2015-~ Predic~ <NA> 1
## 8 Alabama 2015-01-17 AL 2015 2 25-44 y~ 2015-~ Predic~ <NA> 1
## 9 Alabama 2015-01-17 AL 2015 2 45-64 y~ 2015-~ Predic~ <NA> 1
## 10 Alabama 2015-01-17 AL 2015 2 65-74 y~ 2015-~ Predic~ <NA> 1
## # ... with 91,527 more rows, and 2 more variables: deaths <dbl>, Note <chr>
The process now reads and processes the file and runs diagnostics for sanity checks.
The helperKeyStateClusterMetrics() function is also updated:
# Convert the output of a state-level clustering list to population, membership, and deaths
helperKeyStateClusterMetrics <- function(lst) {
# FUNCTION ARGUMENTS:
# lst: the list containing the outputs of the state clustering routing
# Extract the relevant elements from lst
plotClusters <- lst[["plotDataList"]][["plotClusters"]]
dfFull <- lst[["plotDataList"]][["dfFull"]] %>% select(state, date, pop, tot_deaths, new_deaths)
# Create the aggregated data
df <- joinFrames(plotClusters, dfFull, keyJoin="state")
# Create reported cvDeaths by cluster
dfWeekly <- df %>%
mutate(year=lubridate::epiyear(date), week=lubridate::epiweek(date), cluster=as.character(cluster)) %>%
group_by(state, cluster, year, week) %>%
summarize(pop=max(pop),
tot_deaths=specNA(max)(tot_deaths),
new_deaths=specNA(sum)(new_deaths),
.groups="drop"
) %>%
group_by(cluster, year, week) %>%
summarize(pop=sum(pop),
tot_deaths=specNA(sum)(tot_deaths),
new_deaths=specNA(sum)(new_deaths),
.groups="drop"
) %>%
checkUniqueRows(uniqueBy=c("cluster", "year", "week"))
list(deaths=dfWeekly, useClusters=lst[["useClusters"]])
}
# Get a state-level clustering file
cdc_daily_210528 <- readFromRDS("cdc_daily_210528")
clusterList_temp <- helperKeyStateClusterMetrics(cdc_daily_210528)
##
## *** File has been checked for uniqueness by: cluster year week
clusterList_temp
## $deaths
## # A tibble: 512 x 6
## cluster year week pop tot_deaths new_deaths
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 2020 1 1415872 NA NA
## 2 3 2020 2 1415872 NA NA
## 3 3 2020 3 1415872 NA NA
## 4 3 2020 4 16576414 0 0
## 5 3 2020 5 16576414 0 0
## 6 3 2020 6 16576414 0 0
## 7 3 2020 7 16576414 0 0
## 8 3 2020 8 16576414 0 0
## 9 3 2020 9 16576414 1 1
## 10 3 2020 10 16576414 16 15
## # ... with 502 more rows
##
## $useClusters
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS
## 4 9 9 9 6 4 8 5 5 6 9 3 7 4 7 7 7 6 9 8 5 3 5 4 4 9
## MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 4 6 7 4 3 8 7 9 8 6 7 3 9 8 9 7 9 6 4 6 3 3 4 6 4
clusterList_temp$deaths %>%
pivot_longer(-c("cluster", "year", "week")) %>%
filter(name %in% c("tot_deaths", "new_deaths"), !is.na(value)) %>%
ggplot(aes(x=week, y=value)) +
geom_line(aes(group=cluster, color=cluster)) +
facet_grid(name~year, scales="free_y") +
labs(x="Epi Week", y="Deaths (total or incremental)", title="Evolution of deaths by cluster")
Data appear to be usable for the next steps in CDC processing. The function cdcBasicPlots() is updated:
# Basic plots of the CDC data, including by a state-level cluster (passed as argument)
cdcBasicPlots <- function(df,
weekThru=NULL,
curYear=NULL,
p5Years=curYear,
clustVec=NULL,
stateExclude=c()
) {
# FUNCTION ARGUMENTS:
# df: a processed CDC data file
# weekThru: week of the current year that data are thru (NULL means infer from data)
# curYear: current year (NULL means infer from data)
# p5Years: years for plotting deaths by week/cohort (defaults to only current year)
# clustVec: clustering vector with names as state abbreviations (NULL means no plots by cluster)
# stateExclude: vector of states to exclude from the analysis
# Get the week and year if passed as NULL
if (is.null(curYear)) { curYear <- df %>% pull(year) %>% as.character() %>% as.integer() %>% max() }
if (is.null(weekThru)) { weekThru <- df %>% filter(year==curYear) %>% pull(week) %>% max() }
# Filter to exclude stateExclude
df <- df %>%
filter(!(state %in% stateExclude))
# Update subtitle appropriately
subT <- NULL
if (length(stateExclude) > 0) subT <- paste0("Excludes ", paste0(stateExclude, collapse=", "))
# Plot of total deaths by year (50 states plus DC)
p1 <- df %>%
group_by(year) %>%
summarize(deaths=sum(deaths), .groups="drop") %>%
ggplot(aes(x=year, y=deaths/1000)) +
geom_col(fill="lightblue") +
geom_text(aes(y=deaths/2000, label=round(deaths/1000))) +
labs(title=paste0("CDC Deaths for 50 states plus DC through week ", weekThru, " of ", curYear),
x="",
y="Deaths (000s)"
)
if (!is.null(subT)) p1 <- p1 + labs(subtitle=subT)
print(p1)
# Plot of total deaths by week by year
p2 <- df %>%
group_by(year, week) %>%
summarize(deaths=sum(deaths), .groups="drop") %>%
ggplot(aes(x=week, y=deaths, group=year, color=year)) +
geom_line() +
ylim(c(0, NA)) +
labs(x="Week of Year", y="Deaths", title="US deaths per week by year")
if (!is.null(subT)) p2 <- p2 + labs(subtitle=subT)
print(p2)
# Plot of total deaths by year by age cohort
p3 <- df %>%
group_by(year, week, age) %>%
summarize(deaths=sum(deaths), .groups="drop") %>%
ggplot(aes(x=week, y=deaths, group=year, color=year)) +
geom_line() +
ylim(c(0, NA)) +
facet_wrap(~age) +
labs(x="Week of Year", y="Deaths", title="US deaths per week by year")
if (!is.null(subT)) p3 <- p3 + labs(subtitle=subT)
print(p3)
# Plots of total deaths by week by year by cluster
if (!is.null(clustVec)) {
p4 <- df %>%
mutate(cluster=clustVec[state]) %>%
group_by(year, week, cluster) %>%
summarize(deaths=sum(deaths), .groups="drop") %>%
ggplot(aes(x=week, y=deaths, group=year, color=year)) +
geom_line() +
ylim(c(0, NA)) +
facet_wrap(~cluster, scales="free_y") +
labs(x="Week of Year",
y="Deaths",
title="US deaths per week by year"
)
if (is.null(subT)) p4 <- p4 + labs(subtitle="Facetted by state cluster")
else p4 <- p4 + labs(subtitle=paste0("Facetted by state cluster\n", subT))
print(p4)
p5 <- df %>%
mutate(cluster=clustVec[state]) %>%
filter(year %in% p5Years) %>%
group_by(year, age, week, cluster) %>%
summarize(deaths=sum(deaths), .groups="drop") %>%
mutate(weekUse=week+53*(as.integer(as.character(year))-curYear)) %>%
ggplot(aes(x=weekUse, y=deaths, fill=age)) +
geom_col(position="stack") +
ylim(c(0, NA)) +
facet_wrap(~cluster, scales="free_y") +
labs(x="Week of Year",
y="Deaths",
title=paste0("US deaths per week (Week 1 is start of year ", curYear, ")")
)
if (is.null(subT)) p5 <- p5 + labs(subtitle="Facetted by state cluster")
else p5 <- p5 + labs(subtitle=paste0("Facetted by state cluster\n", subT))
print(p5)
}
}
cdcBasicPlots(cdc_temp, clustVec=clusterList_temp$useClusters, stateExclude="NC", p5Years=2019:2021)
The function cdcCohortAnalysis() is updated:
# Function to subset CDC data
subsetCDC <- function(df,
critFilter=vector("list", 0),
showPlot=TRUE,
plotTitle=NULL,
curYear=2020:2021,
prevYears=paste0("2015-", min(curYear)-1)
) {
# FUNCTION ARGUMENTS:
# df: the data frame containing the processed CDC data
# critFilter: a filtering list of the form variable=c(possibleValues)
# showPlot: boolean, whether to show plots of deaths by week and year
# plotTitle: the title to be used for the plot (NULL will use a default title)
# curYear: the current year (will be excluded from the mean and standard deviation plot)
# Filter such that only matches to critFilter are included
for (xNum in seq_len(length(critFilter))) {
df <- df %>%
filter_at(vars(all_of(names(critFilter)[xNum])), ~. %in% critFilter[[xNum]])
}
# Create a default title if none provided
if (is.null(plotTitle)) plotTitle <- "All-cause deaths for filtered cohort"
# All deaths by week and year for the filtered cohort
allDeath <- df %>%
group_by(year, week) %>%
summarize(deaths=sum(deaths), .groups="drop") %>%
mutate(weekfct=factor(week), yearint=as.integer(as.character(year)))
# Show plots if requested
if (showPlot) {
# Plot of all deaths for the filtered cohort by week and year
p1 <- allDeath %>%
ggplot(aes(x=week, y=deaths, color=year, group=year)) +
geom_line() +
ylim(c(0, NA)) +
labs(x="Week", y="Deaths", title=plotTitle)
print(p1)
# Plot of mean and standard deviation from previous years
p2 <- allDeath %>%
filter(!(year %in% curYear)) %>%
group_by(week) %>%
summarize(mean=mean(deaths), sd=sd(deaths), .groups="drop") %>%
ggplot(aes(x=week)) +
geom_line(aes(y=mean), color="red", size=2) +
geom_ribbon(aes(ymin=mean-sd, ymax=mean+sd), fill="pink", alpha=0.5) +
labs(x="Week",
y="Deaths",
title=paste0(prevYears, " ", plotTitle),
subtitle=paste0("Line is ",
prevYears,
" annual mean deaths, ribbon is +/- 1 standard deviation"
)
) +
ylim(c(0, NA))
print(p2)
}
# Return the processed data frame
allDeath
}
# Function to run the linear model to predict CDC deaths by week and year
cdcRegression <- function(df,
curYear=2020:2021,
startYear=2015
) {
# FUNCTION ARGUMENTS:
# df: the data frame containing the processed CDC data
# curYear: the current year(s)
# startYear: the year that is defined as the zero-point for the regression
# Very basic linear model for deaths vs. week and year
df %>%
filter(!(year %in% curYear)) %>%
lm(deaths ~ weekfct + I(yearint-startYear) + 0, data=.)
}
# Function to make predictions based on the regression
cdcPrediction <- function(df,
lmReg
) {
# FUNCTION ARGUMENTS:
# df: the data frame or tibble containing the data
# lmReg: the linear regression model
# Predictions and deltas
allPred <- df %>%
mutate(oldweekfct=weekfct, weekfct=factor(ifelse(weekfct==53, 52, weekfct))) %>%
mutate(pred=predict(lmReg, newdata=.), delta=(deaths-pred)) %>%
group_by(year) %>%
arrange(week) %>%
mutate(cumDelta=cumsum(delta), cumPred=cumsum(pred), weekfct=oldweekfct) %>%
select(-oldweekfct) %>%
ungroup()
# Return the predictions and deltas
allPred
}
# Function to make plots of the predictions
cdcPredictedvsActual <- function(df,
cohortName,
curYear=2020:2021,
prevYears=paste0("2015-", min(curYear)-1),
startWeek=1,
showCurrentPredvsPrev=TRUE,
showActualsvsPred=TRUE,
showActualsvsPredRatio=TRUE,
showExcessDeaths=TRUE,
predActualPlotsOnePage=FALSE
) {
# FUNCTION ARGUMENTS:
# df: the data frame containing actual and predicted deaths
# cohortName: cohort name to use for plot titles
# curYear: the current year
# prevYears: the previous years
# startWeek: week to use for starting the counts of cumulative excess deaths
# showCurrentPredvsPrev: boolean, whether to plot current year predictions vs. previous years
# showActualsvsPred: boolean, whether to plot actuals vs. predictions by year
# showActualsvsPredRatio: boolean, whether to plot actuals vs. predictions ratios by yrea
# showExcessDeaths: boolean, whether to plot estimated excess deaths
# predActualPlotsOnePage: boolean, whether to place the predicted vs. actual plots all on one page
if (showCurrentPredvsPrev) {
# Current year predictions vs. previous year actuals
p1 <- df %>%
ggplot(aes(x=week)) +
geom_line(data=~filter(., !(year %in% curYear)), aes(y=deaths, color=year, group=year)) +
geom_line(data=~filter(., year %in% curYear), aes(y=pred, group=year, color=year), lty=2, size=1) +
ylim(c(0, NA)) +
labs(x="Week",
y="Deaths",
title=paste0("All-cause deaths for ", cohortName, " cohort"),
subtitle=paste0("Solid lines are actual ",
prevYears,
", dashed line is predicted ",
if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear
)
)
if (!predActualPlotsOnePage) print(p1)
}
if (showActualsvsPred) {
# Plot of predictions and actuals facetted by year - deaths
p2 <- df %>%
select(year, week, deaths, pred) %>%
pivot_longer(c(deaths, pred)) %>%
mutate(name=ifelse(name=="deaths", "Actual", "Predicted")) %>%
ggplot(aes(x=week)) +
geom_line(aes(y=value, color=name, group=name)) +
ylim(c(0, NA)) +
facet_wrap(~year) +
labs(x="Week",
y="Deaths",
title=paste0("All-cause deaths for ", cohortName, " cohort"),
subtitle=paste0("Predictions based on simple linear model for ", prevYears, " data")
) +
scale_color_discrete("Deaths")
if (!predActualPlotsOnePage) print(p2)
}
if (showActualsvsPredRatio) {
# Plot of predictions and actuals facetted by year - ratios
p3 <- df %>%
mutate(rat=deaths/pred) %>%
ggplot(aes(x=week)) +
geom_line(aes(y=rat)) +
facet_wrap(~year) +
geom_hline(yintercept=1, lty=2) +
labs(x="Week",
y="Ratio (Actual vs. Predicted Deaths)",
title=paste0("Actual vs. predicted deaths for ", cohortName, " cohort"),
subtitle=paste0("Predictions based on simple linear model for ", prevYears, " data")
)
if (!predActualPlotsOnePage) print(p3)
}
if (showExcessDeaths) {
# Plot of excess deaths after a starting week
p4 <- df %>%
select(-cumDelta, -cumPred) %>%
filter(week >= startWeek) %>%
group_by(year) %>%
arrange(week) %>%
mutate(cumDelta=cumsum(delta)) %>%
ggplot(aes(x=week)) +
geom_line(aes(y=cumDelta)) +
facet_wrap(~year) +
geom_hline(yintercept=0, lty=2) +
labs(x="Week",
y=paste0("Deaths vs. prediction (cumulative from week ", startWeek, ")"),
title=paste0("Cumulative estimated excess deaths starting week ",
startWeek,
"\nfor ",
cohortName,
" cohort"
),
subtitle=paste0("Predictions based on simple linear model for ", prevYears, " data")
)
if (!predActualPlotsOnePage) print(p4)
}
# Create a one-page summary
if (predActualPlotsOnePage) {
if (!exists("p1", inherits=FALSE)) p1 <- ggplot()
if (!exists("p2", inherits=FALSE)) p2 <- ggplot()
if (!exists("p3", inherits=FALSE)) p3 <- ggplot()
if (!exists("p4", inherits=FALSE)) p4 <- ggplot()
gridExtra::grid.arrange(p1, p2, p3, p4, nrow=2)
}
}
# Integrated function with all steps
cdcCohortAnalysis <- function(cohortName,
df=cdcProcessed,
critFilter=vector("list", 0),
plotTitle=NULL,
curYear=2020:2021,
startYear=2015,
prevYears=paste0(startYear, "-", min(curYear)-1),
startWeek=1,
showSubsetPlots=TRUE,
showPredActualPlots=TRUE,
predActualPlotsOnePage=FALSE
) {
# FUNCTION ARGUMENTS:
# cohortName: the name of the cohort
# df: the processed CDC death data
# critFilter: the filtering criteria in the form of a named list "variable"=c("possibleValues")
# plotTitle: title to be used for plot in subsetCDC()
# curYear: the current year(s) for the death data
# startYear: the starting year for the death data
# prevYears: the previous years for the death data (used for making predictions)
# startWeek: the starting week for counting excess deaths
# showSubsetPlots: boolean, whether to show the two plots in subsetCDC()
# showPredActualPlots: boolean, whether to show the four plots in cdcPredictedvsActual()
# predActualPlotsOnePage: boolean, whether to place the predicted vs. actual plots all on one page
# Subset the data (will show both plots by default)
allDeath_cohort <- subsetCDC(df=df,
critFilter=critFilter,
showPlot=showSubsetPlots,
plotTitle=plotTitle,
curYear=curYear,
prevYears=prevYears
)
# Create the linear model for the subsetted data
lm_cohort <- cdcRegression(df=allDeath_cohort, curYear=curYear, startYear=startYear)
# Apply the predictions to the data
allPred_cohort <- cdcPrediction(df=allDeath_cohort, lmReg=lm_cohort)
# Plots for predicted vs. actual (will show all plots by default)
cdcPredictedvsActual(df=allPred_cohort,
cohortName=cohortName,
curYear=curYear,
prevYears=prevYears,
startWeek=startWeek,
showCurrentPredvsPrev=showPredActualPlots,
showActualsvsPred=showPredActualPlots,
showActualsvsPredRatio=showPredActualPlots,
showExcessDeaths=showPredActualPlots,
predActualPlotsOnePage=predActualPlotsOnePage
)
# Return a list containing the lm model and the filtered data
list(lmReg=lm_cohort, preds=allPred_cohort)
}
list_allUS_temp <- cdcCohortAnalysis(cohortName="all US",
df=cdc_temp %>% filter(!(state %in% c("NC"))),
curYear=2020:2021,
startYear=2015,
startWeek=1,
plotTitle="All-cause US total deaths",
predActualPlotsOnePage=TRUE
)
list_allUS_temp
## $lmReg
##
## Call:
## lm(formula = deaths ~ weekfct + I(yearint - startYear) + 0, data = .)
##
## Coefficients:
## weekfct1 weekfct2 weekfct3
## 56785.0 57133.8 55731.8
## weekfct4 weekfct5 weekfct6
## 54810.2 54443.2 54506.8
## weekfct7 weekfct8 weekfct9
## 54138.2 53497.6 53296.2
## weekfct10 weekfct11 weekfct12
## 53414.8 52522.0 52125.0
## weekfct13 weekfct14 weekfct15
## 51404.8 51357.6 51011.0
## weekfct16 weekfct17 weekfct18
## 49969.0 49346.2 49298.0
## weekfct19 weekfct20 weekfct21
## 48537.8 48023.8 47864.8
## weekfct22 weekfct23 weekfct24
## 47480.4 47945.0 47485.0
## weekfct25 weekfct26 weekfct27
## 47381.6 47239.2 47446.8
## weekfct28 weekfct29 weekfct30
## 47179.0 46793.4 46694.0
## weekfct31 weekfct32 weekfct33
## 46772.4 46904.0 46694.6
## weekfct34 weekfct35 weekfct36
## 46582.2 46928.2 47150.4
## weekfct37 weekfct38 weekfct39
## 47418.6 47437.4 47544.6
## weekfct40 weekfct41 weekfct42
## 48562.0 48367.6 49012.8
## weekfct43 weekfct44 weekfct45
## 49172.0 49497.6 49792.8
## weekfct46 weekfct47 weekfct48
## 50453.4 50525.8 50980.0
## weekfct49 weekfct50 weekfct51
## 52046.6 52429.8 53166.0
## weekfct52 I(yearint - startYear)
## 53725.2 760.6
##
##
## $preds
## # A tibble: 330 x 9
## year week deaths weekfct yearint pred delta cumDelta cumPred
## <fct> <int> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2015 1 59578 1 2015 56785. 2793. 2793. 56785.
## 2 2016 1 53867 1 2016 57546. -3679. -3679. 57546.
## 3 2017 1 57719 1 2017 58306. -587. -587. 58306.
## 4 2018 1 64008 1 2018 59067. 4941. 4941. 59067.
## 5 2019 1 56359 1 2019 59827. -3468. -3468. 59827.
## 6 2020 1 58025 1 2020 60588. -2563. -2563. 60588.
## 7 2021 1 84172 1 2021 61349. 22823. 22823. 61349.
## 8 2015 2 58974 2 2015 57134. 1840. 4633. 113919.
## 9 2016 2 53615 2 2016 57894. -4279. -7958. 115440.
## 10 2017 2 58982 2 2017 58655 327. -260. 116961.
## # ... with 320 more rows
The function cdcAggregateSummary() is updated:
# Function to create data and plots for an aggregate (cluster in this case)
cdcAggregateSummary <- function(df,
critVar,
critSubsets,
startWeek,
idVarName=critVar,
curYear=2020:2021,
startYear=2015,
subListNames=NULL,
critListNames=subListNames,
factorCritList=!is.null(critListNames),
popData=NULL,
cvDeathData=NULL,
showAllPlots=TRUE,
showStep1Plots=showAllPlots,
showStep3Plots=showAllPlots
) {
# FUNCTION ARGUMENTS:
# df: the data frame containing the processed CDC all-cause deaths data
# critVar: the main variable supplied to critFilter
# critSubsets: the subsets used for critVar
# startWeek: the starting week for analysis (will assume no excess deaths prior to startWeek or curYear)
# idVarName: name to be used for the identifying variable when excess deaths data are combined
# curYear: current year(s) of CDC deaths data
# subListNames: names to be used in place of numbers for idVarName column of excessFull
# (NULL means use numbers)
# critListNames: names to be used for describing the key cohorts (NULL means use numbers)
# factorCritList: boolean, whether to factor the variable that results from critListNames
# startYear: starting year of CDC deaths data
# popData: population data file, unique and matching by idVarName (NULL means no population plots)
# cvDeathData: coronavirus deaths data file, unique and matching by idVarName
# NULL means no plots of excess all-cause deaths vs reported coronavirus deaths
# showAllPlots: boolean, whether to show all the plots
# showStep1Plots: boolean, whether to show a 1-page plot summary for each element of critSubsets
# showStep3Plots: boolean, whether to show a 1-page cross-element summary
# STEP 0: Initialize list to store results for each combination of critSubsets
tmpList <- vector("list", length(critSubsets))
# STEP 1: Run the process for each combination of critSubsets (produce plots if showStep1Plots is TRUE)
for (ctr in seq_along(critSubsets)) {
critUse <- list(critSubsets[[ctr]])
names(critUse) <- critVar
useName <- if(is.null(critListNames)) paste0("Iteration: ", ctr) else critListNames[ctr]
tmpList[[ctr]] <- cdcCohortAnalysis(cohortName=useName,
df=df,
critFilter=critUse,
plotTitle="",
curYear=curYear,
startYear=startYear,
startWeek=startWeek,
showSubsetPlots=FALSE,
showPredActualPlots=showStep1Plots,
predActualPlotsOnePage=showStep1Plots
)
}
# STEP 1a: Add names if requeated
if (!is.null(subListNames)) names(tmpList) <- subListNames
# STEP 2: Create the excess file
excessFull <- map_dfr(.x=tmpList, .f=function(x) { x[["preds"]] }, .id=idVarName)
# STEP 2a: Convert idVarName to factor if critListName is passed
if (factorCritList) {
excessFull <- excessFull %>%
mutate(!!idVarName:=factor(get(idVarName), levels=critListNames))
}
# STEP 3: Produce a plot of excess deaths by week by idVarName
if (showStep3Plots) {
p1 <- excessFull %>%
filter(year %in% curYear) %>%
ggplot(aes(x=week, y=delta)) +
geom_line(aes_string(group=idVarName, color=idVarName)) +
labs(title=paste0("Actual minus predicted deaths by week (",
if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear,
")"
),
x="Week",
y="Actual minus predicted deaths"
) +
facet_wrap(~year) +
scale_color_discrete(idVarName)
print(p1)
}
# STEP 4: Augment the data to include population (if popData is not NULL) and plot "per million"
if (!is.null(popData)) {
# Add population data to excessFull (only variables should be idVarName and pop)
excessFull <- excessFull %>%
left_join(select_at(popData, vars(all_of(c(idVarName, "pop")))), by=idVarName)
# Plot of excess deaths per million by week by cluster
p2 <- excessFull %>%
filter(year %in% curYear) %>%
ggplot(aes(x=week, y=1000000*delta/pop)) +
geom_line(aes(group=cluster, color=cluster)) +
labs(title=paste0("Actual minus predicted deaths by week (",
if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear,
")"
),
subtitle="Per million people, per week",
x="Week",
y="Actual minus predicted deaths (per million people)"
) +
scale_color_discrete(idVarName) +
facet_wrap(~year) +
geom_hline(aes(yintercept=0), lty=2)
print(p2)
}
# STEP 5: Augment the data to show plot of excess all-cause deaths vs. reported coronavirus deaths
if (!is.null(cvDeathData)) {
# Add the reported coronavirus deaths total
excessFull <- excessFull %>%
left_join(select_at(cvDeathData, vars(all_of(c(idVarName, "year", "week", "cvDeaths")))),
by=c(idVarName, "yearint"="year", "week")
) %>%
mutate(cvDeaths=ifelse(is.na(cvDeaths) | !(year %in% curYear), 0, cvDeaths))
# Plot for total excess deaths and reported coronavirus deaths by idVarName by week
p3 <- excessFull %>%
filter(year %in% curYear) %>%
mutate(useWeek=week+53*(yearint-min(curYear))) %>%
select_at(vars(all_of(c(idVarName, "year", "week", "useWeek", "cvDeaths", "delta")))) %>%
pivot_longer(c(cvDeaths, delta), names_to="source", values_to="deaths") %>%
ggplot(aes(x=useWeek, y=deaths)) +
geom_line(aes(group=source,
color=c("cvDeaths"="Reported COVID", "delta"="Excess All-Cause")[source]
)
) +
facet_wrap(~get(idVarName)) +
scale_color_discrete("Deaths") +
labs(x="Week",
y="Deaths",
title=paste0("Deaths by ", idVarName, " by week"),
subtitle=paste0("Week 1 is the first epi week of ", min(curYear))
)
print(p3)
}
# STEP n: Return the excess file
excessFull
}
clNames <- clusterList_temp$useClusters %>% unique()
clValues <- lapply(clNames, FUN=function(x) names(clusterList_temp$useClusters)[clusterList_temp$useClusters==x])
clusterAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))),
critVar="state",
critSubsets=clValues,
startWeek=1,
subListNames=clNames,
critListNames=paste0("cluster ", clNames),
factorCritList=FALSE,
popData=clusterList_temp$deaths %>%
group_by(cluster) %>%
summarize(pop=max(pop), .groups="drop"),
cvDeathData=clusterList_temp$deaths %>%
select(cluster, year, week, cvDeaths=new_deaths),
idVarName="cluster"
)
stateAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))),
critVar="state",
critSubsets=setdiff(names(clusterList_temp$useClusters), "NC"),
startWeek=1,
idVarName="state",
subListNames=setdiff(names(clusterList_temp$useClusters), "NC"),
showAllPlots=FALSE
)
stateAgg_temp
## # A tibble: 16,500 x 10
## state year week deaths weekfct yearint pred delta cumDelta cumPred
## <fct> <fct> <int> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AK 2015 1 62 1 2015 69.7 -7.71 -7.71 69.7
## 2 AK 2016 1 60 1 2016 71.4 -11.4 -11.4 71.4
## 3 AK 2017 1 85 1 2017 73 12.0 12.0 73
## 4 AK 2018 1 71 1 2018 74.6 -3.65 -3.65 74.6
## 5 AK 2019 1 87 1 2019 76.3 10.7 10.7 76.3
## 6 AK 2020 1 77 1 2020 77.9 -0.938 -0.938 77.9
## 7 AK 2021 1 103 1 2021 79.6 23.4 23.4 79.6
## 8 AK 2015 2 77 2 2015 73.1 3.89 -3.82 143.
## 9 AK 2016 2 65 2 2016 74.8 -9.75 -21.1 146.
## 10 AK 2017 2 77 2 2017 76.4 0.600 12.6 149.
## # ... with 16,490 more rows
ageAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))),
critVar="age",
critSubsets=levels(cdc_temp$age),
startWeek=1,
idVarName="age",
subListNames=levels(cdc_temp$age),
showAllPlots=TRUE
)
ageAgg_temp
## # A tibble: 1,980 x 10
## age year week deaths weekfct yearint pred delta cumDelta cumPred
## <fct> <fct> <int> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Under 25 yea~ 2015 1 1027 1 2015 1099. -72.4 -72.4 1099.
## 2 Under 25 yea~ 2016 1 1027 1 2016 1079. -51.9 -51.9 1079.
## 3 Under 25 yea~ 2017 1 1107 1 2017 1058. 48.6 48.6 1058.
## 4 Under 25 yea~ 2018 1 1142 1 2018 1038. 104. 104. 1038.
## 5 Under 25 yea~ 2019 1 989 1 2019 1017. -28.4 -28.4 1017.
## 6 Under 25 yea~ 2020 1 1067 1 2020 997. 70.1 70.1 997.
## 7 Under 25 yea~ 2021 1 994 1 2021 976. 17.6 17.6 976.
## 8 Under 25 yea~ 2015 2 1050 2 2015 1090. -39.6 -112. 2189.
## 9 Under 25 yea~ 2016 2 1036 2 2016 1069. -33.1 -85.0 2148.
## 10 Under 25 yea~ 2017 2 1125 2 2017 1049. 76.4 125. 2107
## # ... with 1,970 more rows
The function helperKeyStateExcessPlots() is updated:
# Create plots for state-level excess deaths
helperKeyStateExcessPlots <- function(df,
epiMonth,
cvDeaths,
subT,
startWeek,
cvDeathDate,
curYear=2020:2021,
popData=select(usmap::statepop, state=abbr, pop=pop_2015)
) {
# FUNCTION ARGUMENTS:
# df: the data frame containing the processed state-level excess deaths data
# epiMonth: a mapping file with columns ew-month-quarter that maps an epiweek to a month/quarter
# cvDeaths: a file containing cv deaths that includes state-date-name-vpm
# subT: the subtitle to be used for the plots 9describe the vintage of the CDC data)
# startWeek: the starting week for the analysis (data from min(curYear) is deleted if week < startWeek)
# cvDeathDate: the date to use for puling deaths from the cvDeaths file
# curYear: the current year(s)
# popData: a file containing only state-pop, with population by state
# STEP 0: Create a general plotting database with quarter and month
plotData <- df %>%
filter(year %in% curYear) %>%
left_join(epiMonth, by=c("week"="ew")) %>%
mutate(quarter=factor(paste0(year, "-", "Q", quarter)),
postStart=ifelse(week>=startWeek | yearint>min(curYear), 1, 0)
) %>%
group_by(state, yearint, quarter, month, postStart) %>%
summarize(excess=sum(delta), .groups="drop") %>%
left_join(popData, by="state") %>%
mutate(excesspm=excess*1000000/pop)
# STEP 1: Plot of excess deaths by quarter
p1 <- plotData %>%
group_by(state, quarter) %>%
summarize(excess=sum(excess), .groups="drop") %>%
ggplot(aes(x=fct_reorder(state, excess, .fun=sum), y=excess/1000, fill=fct_rev(quarter))) +
geom_col(position="stack") +
coord_flip() +
labs(x="State",
y="Excess Deaths (000s)",
title=paste0("All-cause excess deaths in ",
if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
),
subtitle=subT
) +
scale_fill_discrete("Quarter")
print(p1)
# STEP 2: Plot of excess deaths per million by state by quarter
p2 <- plotData %>%
group_by(state, quarter) %>%
summarize(excesspm=sum(excesspm), .groups="drop") %>%
ggplot(aes(x=fct_reorder(state, excesspm, .fun=sum), y=excesspm, fill=fct_rev(quarter))) +
geom_col(position="stack") +
coord_flip() +
labs(x="State",
y="Excess Deaths (per million people)",
title=paste0("All-cause excess deaths per million people in ",
if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
),
subtitle=subT
) +
scale_fill_discrete("Quarter")
print(p2)
# STEP 3: Plot of excess deaths by epi-month
pes_001 <- plotData %>%
filter(postStart==1) %>%
ggplot(aes(x=fct_reorder(state, excesspm, .fun=sum))) +
geom_col(aes(y=excesspm, fill=fct_rev(paste0(yearint, "-", match(month, month.abb) %>% zeroPad2()))),
position="stack"
) +
coord_flip() +
labs(x="State",
y="Excess Deaths (per million people)",
title=paste0("All-cause excess deaths per million people in ",
if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
),
subtitle=subT,
caption="Epi week converted to month based on most frequent month in epi week"
) +
scale_fill_discrete("Epi Month")
print(pes_001)
# STEP 4: Layer on the reported coronavirus deaths data
cumDeath <- cvDeaths %>%
filter(name=="deaths", state != "cluster") %>%
group_by(state) %>%
mutate(cumdpm=cumsum(vpm)) %>%
filter(date %in% c(cvDeathDate, max(date)))
pes_002 <- pes_001 +
geom_point(data=filter(cumDeath, date==cvDeathDate), aes(x=fct_reorder(state, cumdpm), y=cumdpm), size=2) +
labs(title=paste0("Deaths per million people using data through ", format(cvDeathDate, "%b %d, %Y")),
subtitle="Bars based on excess CDC all-cause deaths, points are reported coronavirus deaths",
y="Deaths (per million people)"
)
print(pes_002)
# STEP 5: Show ratio of estimated all-cause deaths to reported coronavirus deaths
p5 <- plotData %>%
filter(postStart==1) %>%
group_by(state) %>%
summarize(excesspm=sum(excesspm)) %>%
inner_join(select(filter(cumDeath, date==cvDeathDate), state, cumdpm)) %>%
mutate(rat=excesspm/cumdpm) %>%
ggplot(aes(x=fct_reorder(state, rat), y=rat)) +
geom_col(fill="lightblue") +
geom_text(aes(y=rat/2, label=paste0(round(100*rat), "%"))) +
coord_flip() +
labs(x="State",
y="Ratio of estimated all-cause excess deaths to reported coronavirus deaths",
title="Ratio of estimated all-cause excess deaths to reported coronavirus deaths",
subtitle=paste0("All data through ", format(cvDeathDate, "%b %d, %Y"))
) +
ylim(c(0, NA)) +
geom_hline(aes(yintercept=1), lty=2)
print(p5)
# STEP n: Return the plotting data
plotData
}
weekThru <- 17
curYear <- 2020:2021
cvDeathThru <- "2021-05-26"
stateExcessPlot_temp <- helperKeyStateExcessPlots(df=stateAgg_temp,
epiMonth=readFromRDS("epiMonth"),
cvDeaths=readFromRDS("cdc_daily_210528")$plotDataList$dfFull %>%
select(state, date, vpm=dpm) %>%
mutate(name="deaths") %>%
filter(!is.na(vpm)),
startWeek=10,
cvDeathDate=as.Date(cvDeathThru),
subT=paste0("CDC data through week ",
weekThru,
" of ",
max(curYear)
),
curYear=curYear,
popData=readFromRDS("statePop2019") %>%
select(state=stateAbb, pop=pop_2019)
)
## Joining, by = "state"
The function helperKeyAgeExcessPlots() is updated:
# Create plots for age-cohort excess deaths
helperKeyAgeExcessPlots <- function(df,
epiMonth,
popData,
subT,
startWeek,
curYear=2020:2021
) {
# FUNCTION ARGUMENTS:
# df: the data frame containing the processed state-level excess deaths data
# epiMonth: a mapping file with columns ew-month-quarter that maps an epiweek to a month/quarter
# popData: a file containing only age-pop, with population by age cohort
# subT: the subtitle to be used for the plots 9describe the vintage of the CDC data)
# startWeek: the starting week for the analysis (only data in year==curYear with week>=startWeek used)
# curYear: the current year
# STEP 0: Create a general plotting database with quarter and month
plotData <- df %>%
filter(year %in% curYear) %>%
left_join(epiMonth, by=c("week"="ew")) %>%
mutate(quarter=factor(paste0(year, "-", "Q", quarter)),
postStart=ifelse(week>=startWeek | yearint>min(curYear), 1, 0)
) %>%
group_by(age, yearint, quarter, month, postStart) %>%
summarize(excess=sum(delta), .groups="drop") %>%
left_join(popData, by="age") %>%
mutate(excesspm=excess*1000000/pop)
# STEP 1: Plot of excess deaths by quarter
p1 <- plotData %>%
group_by(age, quarter) %>%
summarize(excess=sum(excess), .groups="drop") %>%
ggplot(aes(x=fct_reorder(age, excess, .fun=sum), y=excess/1000, fill=fct_rev(quarter))) +
geom_col(position="stack") +
coord_flip() +
labs(x="Age Group",
y="Excess Deaths (000s)",
title=paste0("All-cause excess deaths in ",
if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
),
subtitle=subT
) +
scale_fill_discrete("Quarter")
print(p1)
# STEP 2: Plot of excess deaths by epi-month
pea_001 <- plotData %>%
filter(postStart==1) %>%
ggplot(aes(x=fct_reorder(age, excesspm, .fun=sum))) +
geom_col(aes(y=excesspm, fill=fct_rev(paste0(yearint, "-", match(month, month.abb) %>% zeroPad2()))),
position="stack"
) +
coord_flip() +
labs(x="Age Group",
y="Excess Deaths (per million people)",
title=paste0("All-cause excess deaths in ",
if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
),
subtitle=subT,
caption="Epi week converted to month based on most frequent month in epi week"
) +
geom_text(data=. %>% group_by(age) %>% summarize(excesspm=sum(excesspm)),
aes(y=excesspm+100, label=round(excesspm), hjust=0)
) +
scale_fill_discrete("Epi Month")
print(pea_001)
# STEP 3: Return the plotting data
plotData
}
curYear <- 2020:2021
ageExcessPlot_temp <- helperKeyAgeExcessPlots(df=ageAgg_temp,
epiMonth=readFromRDS("epiMonth"),
popData=readFromRDS("usPopBucket2020"),
startWeek=10,
subT=paste0("CDC data through week ",
weekThru,
" of ",
max(curYear)
),
curYear=curYear
)
The function cdcAggregateSummary() is updtaed to allow for piping detailed summaries to pdf:
# Function to create data and plots for an aggregate (cluster in this case)
cdcAggregateSummary <- function(df,
critVar,
critSubsets,
startWeek,
idVarName=critVar,
curYear=2020:2021,
startYear=2015,
subListNames=NULL,
critListNames=subListNames,
factorCritList=!is.null(critListNames),
popData=NULL,
cvDeathData=NULL,
showAllPlots=TRUE,
showStep1Plots=showAllPlots,
showStep3Plots=showAllPlots,
pdfFile=NULL
) {
# FUNCTION ARGUMENTS:
# df: the data frame containing the processed CDC all-cause deaths data
# critVar: the main variable supplied to critFilter
# critSubsets: the subsets used for critVar
# startWeek: the starting week for analysis (will assume no excess deaths prior to startWeek or curYear)
# idVarName: name to be used for the identifying variable when excess deaths data are combined
# curYear: current year(s) of CDC deaths data
# subListNames: names to be used in place of numbers for idVarName column of excessFull
# (NULL means use numbers)
# critListNames: names to be used for describing the key cohorts (NULL means use numbers)
# factorCritList: boolean, whether to factor the variable that results from critListNames
# startYear: starting year of CDC deaths data
# popData: population data file, unique and matching by idVarName (NULL means no population plots)
# cvDeathData: coronavirus deaths data file, unique and matching by idVarName
# NULL means no plots of excess all-cause deaths vs reported coronavirus deaths
# showAllPlots: boolean, whether to show all the plots
# showStep1Plots: boolean, whether to show a 1-page plot summary for each element of critSubsets
# showStep3Plots: boolean, whether to show a 1-page cross-element summary
# pdfFile: a path for the detailed step 1 plots to be stored as PDF (NULL means store in the log)
# STEP 0: Initialize list to store results for each combination of critSubsets
tmpList <- vector("list", length(critSubsets))
# STEP 1: Run the process for each combination of critSubsets (produce plots if showStep1Plots is TRUE)
# Redirect detailed plots to PDF if requested
if (!is.null(pdfFile)) {
cat("\nDetailed", idVarName, "summary PDF file is available at:", pdfFile, "\n")
pdf(pdfFile, width=11)
}
# Run the detailed sumaries for each level
for (ctr in seq_along(critSubsets)) {
critUse <- list(critSubsets[[ctr]])
names(critUse) <- critVar
useName <- if(is.null(critListNames)) paste0("Iteration: ", ctr) else critListNames[ctr]
tmpList[[ctr]] <- cdcCohortAnalysis(cohortName=useName,
df=df,
critFilter=critUse,
plotTitle="",
curYear=curYear,
startYear=startYear,
startWeek=startWeek,
showSubsetPlots=FALSE,
showPredActualPlots=showStep1Plots,
predActualPlotsOnePage=showStep1Plots
)
}
# Redirect all other output to the main log
if (!is.null(pdfFile)) {
cat("\nReturning plot outputs to the main log file\n")
dev.off()
}
# STEP 1a: Add names if requeated
if (!is.null(subListNames)) names(tmpList) <- subListNames
# STEP 2: Create the excess file
excessFull <- map_dfr(.x=tmpList, .f=function(x) { x[["preds"]] }, .id=idVarName)
# STEP 2a: Convert idVarName to factor if critListName is passed
if (factorCritList) {
excessFull <- excessFull %>%
mutate(!!idVarName:=factor(get(idVarName), levels=critListNames))
}
# STEP 3: Produce a plot of excess deaths by week by idVarName
if (showStep3Plots) {
p1 <- excessFull %>%
filter(year %in% curYear) %>%
ggplot(aes(x=week, y=delta)) +
geom_line(aes_string(group=idVarName, color=idVarName)) +
labs(title=paste0("Actual minus predicted deaths by week (",
if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear,
")"
),
x="Week",
y="Actual minus predicted deaths"
) +
facet_wrap(~year) +
scale_color_discrete(idVarName)
print(p1)
}
# STEP 4: Augment the data to include population (if popData is not NULL) and plot "per million"
if (!is.null(popData)) {
# Add population data to excessFull (only variables should be idVarName and pop)
excessFull <- excessFull %>%
left_join(select_at(popData, vars(all_of(c(idVarName, "pop")))), by=idVarName)
# Plot of excess deaths per million by week by cluster
p2 <- excessFull %>%
filter(year %in% curYear) %>%
ggplot(aes(x=week, y=1000000*delta/pop)) +
geom_line(aes(group=cluster, color=cluster)) +
labs(title=paste0("Actual minus predicted deaths by week (",
if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear,
")"
),
subtitle="Per million people, per week",
x="Week",
y="Actual minus predicted deaths (per million people)"
) +
scale_color_discrete(idVarName) +
facet_wrap(~year) +
geom_hline(aes(yintercept=0), lty=2)
print(p2)
}
# STEP 5: Augment the data to show plot of excess all-cause deaths vs. reported coronavirus deaths
if (!is.null(cvDeathData)) {
# Add the reported coronavirus deaths total
excessFull <- excessFull %>%
left_join(select_at(cvDeathData, vars(all_of(c(idVarName, "year", "week", "cvDeaths")))),
by=c(idVarName, "yearint"="year", "week")
) %>%
mutate(cvDeaths=ifelse(is.na(cvDeaths) | !(year %in% curYear), 0, cvDeaths))
# Plot for total excess deaths and reported coronavirus deaths by idVarName by week
p3 <- excessFull %>%
filter(year %in% curYear) %>%
mutate(useWeek=week+53*(yearint-min(curYear))) %>%
select_at(vars(all_of(c(idVarName, "year", "week", "useWeek", "cvDeaths", "delta")))) %>%
pivot_longer(c(cvDeaths, delta), names_to="source", values_to="deaths") %>%
ggplot(aes(x=useWeek, y=deaths)) +
geom_line(aes(group=source,
color=c("cvDeaths"="Reported COVID", "delta"="Excess All-Cause")[source]
)
) +
facet_wrap(~get(idVarName)) +
scale_color_discrete("Deaths") +
labs(x="Week",
y="Deaths",
title=paste0("Deaths by ", idVarName, " by week"),
subtitle=paste0("Week 1 is the first epi week of ", min(curYear))
)
print(p3)
}
# STEP n: Return the excess file
excessFull
}
clNames <- clusterList_temp$useClusters %>% unique()
clValues <- lapply(clNames, FUN=function(x) names(clusterList_temp$useClusters)[clusterList_temp$useClusters==x])
clusterAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))),
critVar="state",
critSubsets=clValues,
startWeek=1,
subListNames=clNames,
critListNames=paste0("cluster ", clNames),
factorCritList=FALSE,
popData=clusterList_temp$deaths %>%
group_by(cluster) %>%
summarize(pop=max(pop), .groups="drop"),
cvDeathData=clusterList_temp$deaths %>%
select(cluster, year, week, cvDeaths=new_deaths),
idVarName="cluster",
showAllPlots=TRUE,
pdfFile="./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w17.pdf"
)
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w17.pdf
##
## Returning plot outputs to the main log file
stateAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))),
critVar="state",
critSubsets=setdiff(names(clusterList_temp$useClusters), "NC"),
startWeek=1,
idVarName="state",
subListNames=setdiff(names(clusterList_temp$useClusters), "NC"),
showAllPlots=FALSE
)
stateAgg_temp
## # A tibble: 16,500 x 10
## state year week deaths weekfct yearint pred delta cumDelta cumPred
## <fct> <fct> <int> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AK 2015 1 62 1 2015 69.7 -7.71 -7.71 69.7
## 2 AK 2016 1 60 1 2016 71.4 -11.4 -11.4 71.4
## 3 AK 2017 1 85 1 2017 73 12.0 12.0 73
## 4 AK 2018 1 71 1 2018 74.6 -3.65 -3.65 74.6
## 5 AK 2019 1 87 1 2019 76.3 10.7 10.7 76.3
## 6 AK 2020 1 77 1 2020 77.9 -0.938 -0.938 77.9
## 7 AK 2021 1 103 1 2021 79.6 23.4 23.4 79.6
## 8 AK 2015 2 77 2 2015 73.1 3.89 -3.82 143.
## 9 AK 2016 2 65 2 2016 74.8 -9.75 -21.1 146.
## 10 AK 2017 2 77 2 2017 76.4 0.600 12.6 149.
## # ... with 16,490 more rows
ageAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))),
critVar="age",
critSubsets=levels(cdc_temp$age),
startWeek=1,
idVarName="age",
subListNames=levels(cdc_temp$age),
showAllPlots=TRUE,
pdfFile="./RInputFiles/Coronavirus/Plots/CDC_age_2021w17.pdf"
)
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w17.pdf
##
## Returning plot outputs to the main log file
ageAgg_temp
## # A tibble: 1,980 x 10
## age year week deaths weekfct yearint pred delta cumDelta cumPred
## <fct> <fct> <int> <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Under 25 yea~ 2015 1 1027 1 2015 1099. -72.4 -72.4 1099.
## 2 Under 25 yea~ 2016 1 1027 1 2016 1079. -51.9 -51.9 1079.
## 3 Under 25 yea~ 2017 1 1107 1 2017 1058. 48.6 48.6 1058.
## 4 Under 25 yea~ 2018 1 1142 1 2018 1038. 104. 104. 1038.
## 5 Under 25 yea~ 2019 1 989 1 2019 1017. -28.4 -28.4 1017.
## 6 Under 25 yea~ 2020 1 1067 1 2020 997. 70.1 70.1 997.
## 7 Under 25 yea~ 2021 1 994 1 2021 976. 17.6 17.6 976.
## 8 Under 25 yea~ 2015 2 1050 2 2015 1090. -39.6 -112. 2189.
## 9 Under 25 yea~ 2016 2 1036 2 2016 1069. -33.1 -85.0 2148.
## 10 Under 25 yea~ 2017 2 1125 2 2017 1049. 76.4 125. 2107
## # ... with 1,970 more rows
Next steps are to integrate the functions in readRunCDCAllCause():
# Function to read and run the CDC all-cause deaths analysis
readRunCDCAllCause <- function(loc,
weekThru,
lst,
startYear=2015,
curYear=2021,
startWeek=1,
cvDeathThru=NULL,
epiMap=readFromRDS("epiMonth"),
agePopData=readFromRDS("usPopBucket2020"),
cdcPlotStartWeek=startWeek,
periodKeep=c("2015-2019", "2020"),
dir="./RInputFiles/Coronavirus/",
url="https://data.cdc.gov/api/views/y5bj-9g5w/rows.csv?accessType=DOWNLOAD",
dlData=isFALSE(file.exists(paste0(dir, loc))),
stateNoCheck=c()
) {
# FUNCTION ARGUMENTS:
# loc: the CDC .csv file name (without path)
# weekThru: how many weeks of the current year are the data valid thru?
# lst: a state clustering process output list
# startYear: the starting year in the CDC data
# curYear: the current analyis year in the CDC data
# startWeek: the starting week to use for cumulative sum of difference in expected all-cause deaths
# cvDeathThru: the date to use for pulling the CV death data (NULL means infer from curYear and startWeek)
# epiMap: a mapping file of ew-month-quarter that mas epiweek (ew) to an appropriate month and quarter
# agePopData: data containing US population as age (fct) - pop (int)
# cdcPlotStartWeek: start week for CDC plots (10 is March which avoids a 1-week February outlier)
# periodKeep: the period of previous data in the CDC all-cause deaths file (should be kept regardless)
# dir: the CDC .csv directory (will use paste0(dir, loc) as the file location)
# url: the url for downloading CDC data
# dlData: should CDC data be downloaded? default is to not overwrite an existing file
# stateNoCheck: vector of states that should not error out for failing suppression checks
# STEP 0: Download CDC data if requested
if (dlData) fileDownload(fileName=paste0(dir, loc), url=url)
# Get cvDeathDate if it has been passed as NULL
if (is.null(cvDeathThru)) {
cvDeathThru <- tibble::tibble(date=seq.Date(as.Date(paste0(max(curYear), "-01-01")),
as.Date(paste0(max(curYear), "-12-31")),
by="days"
),
epiWeek=lubridate::epiweek(date)
) %>%
filter(epiWeek==weekThru) %>%
pull(date) %>%
max() %>%
as.character()
cat("\nParameter cvDeathThru has been set as:", cvDeathThru, "\n")
}
# STEP 1: Read and process the CDC data
cdc <- readProcessCDC(loc, weekThru=weekThru, periodKeep=periodKeep, fDir=dir, stateNoCheck=stateNoCheck)
# STEP 2: Create the key data required for using state-level clusters
clusterList <- helperKeyStateClusterMetrics(lst)
# STEP 3: Generate plots of the processed CDC data
cdcBasicPlots(cdc, clustVec=clusterList$clData, stateExclude=stateNoCheck, p5Years=2019:curYear)
# Exclude stateNoCheck states
cat("\nPlots will be run after excluding stateNoCheck states\n")
cdcUse <- cdc %>% filter(!(state %in% stateNoCheck))
# STEP 4: Full US excess deaths
list_allUS <- cdcCohortAnalysis(cohortName="all US",
df=cdcUse,
curYear=2020:curYear,
startYear=startYear,
startWeek=startWeek,
plotTitle="All-cause US total deaths",
predActualPlotsOnePage=TRUE
)
# cat("\n\n\n*** THROUGH STEP 4 ***\n\n\n")
# STEP 5: Generate cluster-level aggregate plots
clNames <- clusterList$useClusters %>% unique()
clValues <- lapply(clNames, FUN=function(x) names(clusterList$useClusters)[clusterList$useClusters==x])
clusterAgg <- cdcAggregateSummary(df=cdcUse,
critVar="state",
critSubsets=clValues,
startWeek=startWeek,
subListNames=clNames,
critListNames=paste0("cluster ", clNames),
factorCritList=FALSE,
popData=clusterList$deaths %>%
group_by(cluster) %>%
summarize(pop=max(pop), .groups="drop"),
cvDeathData=clusterList$deaths %>%
select(cluster, year, week, cvDeaths=new_deaths),
idVarName="cluster",
pdfFile=NULL,
showAllPlots=TRUE
)
# cat("\n\n\n*** THROUGH STEP 5 ***\n\n\n")
# print(clusterAgg)
# print(cdcUse)
# print(clusterList$useClusters)
# STEP 6: Generate state-level aggregate data, then plot
stateAgg <- cdcAggregateSummary(df=cdcUse,
critVar="state",
critSubsets=setdiff(names(clusterList$useClusters), stateNoCheck),
startWeek=startWeek,
idVarName="state",
subListNames=setdiff(names(clusterList$useClusters), stateNoCheck),
showAllPlots=FALSE
)
# cat("\n\n\n*** THROUGH STEP 6a ***\n\n\n")
# print(stateAgg)
helperKeyStateExcessPlots(df=stateAgg,
epiMonth=epiMap,
cvDeaths=lst$plotDataList$dfFull %>%
select(state, date, vpm=dpm) %>%
mutate(name="deaths") %>%
filter(!is.na(vpm)),
startWeek=cdcPlotStartWeek,
cvDeathDate=as.Date(cvDeathThru),
subT=paste0("CDC data through week ",
weekThru,
" of ",
max(curYear)
),
curYear=2020:curYear,
popData=readFromRDS("statePop2019") %>%
select(state=stateAbb, pop=pop_2019)
)
# cat("\n\n\n*** THROUGH STEP 6b ***\n\n\n")
# STEP 7: Generate age-level aggregate data, then plot
ageAgg <- cdcAggregateSummary(df=cdcUse,
critVar="age",
critSubsets=levels(cdc$age),
startWeek=startWeek,
idVarName="age",
subListNames=levels(cdc$age),
showAllPlots=TRUE,
pdfFile=NULL
)
helperKeyAgeExcessPlots(df=ageAgg,
epiMonth=epiMap,
popData=agePopData,
startWeek=cdcPlotStartWeek,
subT=paste0("CDC data through week ",
weekThru,
" of ",
max(curYear)
),
curYear=2020:curYear
)
# STEP 8: Return a list of key files
list(cdc=cdc,
clusterList=clusterList,
allUSAgg=list_allUS$preds,
clusterAgg=clusterAgg,
stateAgg=stateAgg,
ageAgg=ageAgg
)
}
The function is then tested on previously downloaded data:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210623.csv"
cdcList_20210703 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=17,
lst=readFromRDS("cdc_daily_210528"),
dlData=FALSE,
stateNoCheck=c("NC")
)
##
## Parameter cvDeathThru has been set as: 2021-05-01
##
##
## *** Data suppression checks ***
## # A tibble: 2 x 6
## noCheck state problem curWeek n deaths
## <lgl> <chr> <lgl> <lgl> <int> <dbl>
## 1 TRUE NC TRUE FALSE 72 NA
## 2 TRUE NC TRUE TRUE 6 NA
## # A tibble: 2 x 3
## noCheck curWeek n
## <lgl> <lgl> <int>
## 1 TRUE FALSE 72
## 2 TRUE TRUE 6
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 91,537
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10735 0 369164
## 2 25-44 years 13656 0 902390
## 3 45-64 years 16793 0 3549786
## 4 65-74 years 16783 0 3558139
## 5 75-84 years 16790 0 4401133
## 6 85 years and older 16780 0 5681860
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691180
## 2 2015-2019 2016 Predicted (weighted) 14445 0 2723236
## 3 2015-2019 2017 Predicted (weighted) 14404 0 2801986
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830372
## 5 2015-2019 2019 Predicted (weighted) 14415 0 2844025
## 6 2020 2020 Predicted (weighted) 14837 0 3433405
## 7 2021 2021 Predicted (weighted) 4672 0 1138268
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72028 0 13890799
## 2 2020 <NA> 14837 0 3433405
## 3 2021 <NA> 4672 0 1138268
##
##
## Checking variable combination: period Note
## # A tibble: 9 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72028 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only ~ 13194 0 2.96e6
## 3 2020 Data in recent weeks are incomplete. Only ~ 531 0 2.31e5
## 4 2020 Weighted numbers of deaths are 20% or more~ 280 0 6.00e4
## 5 2020 Weights may be too low to account for unde~ 18 0 9.85e3
## 6 2020 <NA> 814 0 1.69e5
## 7 2021 Data in recent weeks are incomplete. Only ~ 4469 0 1.10e6
## 8 2021 Data in recent weeks are incomplete. Only ~ 14 0 9.65e2
## 9 2021 Data in recent weeks are incomplete. Only ~ 189 0 3.58e4
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
## Joining, by = "state"
Function readRunCDCAllCause has updated defaults:
Next steps are to enable the PDF printing capability as an option in the main function.